Caffeine for the Force Project

Background Info

Mrs. Jones, the PM in charge of this project, requests that you analyze this survey data, come up with actionable insights, and if there is monetization potential there, build a product that makes use of said insights, so as to optimize the corresponding revenue stream. Afterwards, you are to report to the stakeholders (Mrs. Jones' team) and provide arguments as to why this project is worth the CMO's time and resources.

According to Mr. Patel, the PM involved in the development and the deployment of the questionnaires involved, the people replying to them were a typical sub-group of the users of the company's site. However, some of them didn't reply to all of the questions, while the background information of some of the users is incomplete. Josie, the analyst who gathered all the data says that this shouldn't be an issue since the sample is large enough (and quite representative of the whole population). She went on to save this data in a .csv file and send it to you with a comment that all missing values are marked as "-".

Best of luck to you in your analysis, and may the Force be with you!

Data Engineering

Data Preparation, Data Representation

In [1]:
path = pwd() # get the current path for use later on
Out[1]:
"C:\\Users\\Zacharias\\Documents\\Julia"
In [2]:
data_location = "d:\\data"
Out[2]:
"d:\\data"
In [3]:
if path != data_location
    cd(data_location)
end
In [4]:
data = readdlm("CaffeineForTheForce.csv", ',')
Out[4]:
101x8 Array{Any,2}:
    "ID"    "MoviesWatched"  "FavoriteEpisode"  …  "FavoriteProduct"
   1        "-"              "VI"                  "VideoGame"      
   2       3                 "VI"                  "Book"           
   3      44                 "IV"                  "Book"           
   4       3                 "-"                   "Book"           
   5       7                 "VI"               …  "App"            
   6       7                 "IV"                  "VideoGame"      
   7       4                 "VI"                  "Book"           
   8       6                 "III"                 "VideoGame"      
   9       7                 "IV"                  "App"            
  10       7                 "IV"               …  "Book"           
  11       5                 "-"                   "None"           
  12       7                 "VII"                 "App"            
   ⋮                                            ⋱                   
  89       7                 "VII"                 "None"           
  90       4                 "VI"               …  "Book"           
  91       6                 "IV"                  "App"            
  92       4                 "VI"                  "VideoGame"      
  93       7                 "VI"                  "App"            
  94       7                 "-"                   "Book"           
  95       5                 "V"                …  "Book"           
  96       3                 "V"                   "None"           
  97       4                 "IV"                  "Book"           
  98       3                 "VI"                  "Book"           
  99       6                 "III"                 "VideoGame"      
 100        "-"              "IV"               …  "Book"           

Hm, it looks like there are some missing values in the MoviesWatched variable. Let's see how big this issue is across the dataset.

In [5]:
N, n = size(data)
Out[5]:
(101,8)
In [6]:
var_names = data[1,:]
data = data[2:end,:]
N -= 1 # no need to count the header as a data point
Out[6]:
100
In [7]:
for i = 1:n
    println(var_names[i], '\t', sum(data[:,i] .== "-"))
end
ID	0
MoviesWatched	10
FavoriteEpisode	10
Age	10
Gender	10
BlogRole	10
MoneySpent	0
FavoriteProduct	0

In order to handle the various variable types more effectively, let's put them all in a data frame.

In [8]:
using DataFrames # need to load the corresponding package first. Note that we could have loaded the data using the readtable() command from this package, if we wanted.
In [9]:
df = DataFrame(data)

for i = 1:n
    cn = symbol(string("x", i))
    nn = symbol(var_names[i])
    rename!(df, cn, nn)
end
In [10]:
head(df)
Out[10]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProduct
11-VI-MActiveUser1050VideoGame
223VI45-NormalUser1000Book
3344IV33F-650Book
443-40MModerator11000Book
557VI25FNormalUser100App
667IV29MActiveUser250VideoGame

Looks like we have a few outliers as well (cases 3, 4, and possibly more). Between these two extreme users that are visible in the above preview, the former has watched 44 movies (even though she is still a toddler!) while the latter has spent more than 10x as much money than the next high-spender. The Force must be strong with these two! Alternatively, these cases could be just typos (in the first case that's almost certain since at the time of this writing there are only 7 movies out there). Either way, since Jedis are non common among the company's clientele, it would be best to deal with thess outliers by either removing them, or substituting its value for something that would make more sense. In the meantime, we can use median instead of mean for handling the missing values of the numeric features (since this metric is not influenced by outliers that much).

However, in order to do the missing value handling more efficiently, it would make sense to do some data representation at this point. So, from the above preview we can deduce that the variables of this dataset can be encoded as Int64, Int64, AbstractString, Int64, AbstractString, AbstractString, Float64, and AbstractString, at least for the time being. As Julia doesn't want to make any assumptions about the data types, everything is encoded as the most generic data type, Any. Also, even though we won't be using variable "ID" as a feature, it is useful to encode it as an integer, since we may need to reference it.

To make this whole type conversion process simpler, let's put all the desired data types in a single array first.

In [11]:
dt = [Int64, Int64, AbstractString, Int64, AbstractString, AbstractString, Float64, AbstractString]
Out[11]:
8-element Array{DataType,1}:
 Int64         
 Int64         
 AbstractString
 Int64         
 AbstractString
 AbstractString
 Float64       
 AbstractString

As most of our variables have missing values, it would be best to convert them afterwards. In the meantime we can work with the target variables, MoneySpent and FavorityProduct, as well as the ID variable.

In [12]:
df[:MoneySpent] = convert(Array{dt[7],1}, df[:MoneySpent])
df[:FavoriteProduct] = convert(Array{dt[8],1}, df[:FavoriteProduct])
df[:ID] = convert(Array{dt[1],1}, df[:ID])
Out[12]:
100-element Array{Int64,1}:
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
   ⋮
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100

Since our main task with this data is to classify each case into a product that they like (FavorieProduct variable), it makes sense to look at each variable on the basis of each class and use that information to fill in the missing values.

In [13]:
Q = unique(df[:FavoriteProduct])
m = length(Q)
println(Q, '\n', m)
AbstractString["VideoGame","Book","App","None"]
4
In [14]:
C = Array(Array{Int64,1}, 4) # initialize class array (an array of arrays)
Out[14]:
4-element Array{Array{Int64,1},1}:
 #undef
 #undef
 #undef
 #undef
In [15]:
for i = 1:m
    C[i] = collect(1:N)[df[:FavoriteProduct] .== Q[i]] # indexes of class i
end
In [16]:
function FindNonMissingElements(X::Array{Any,1}, ind::Array{Int64,1})
    # finds all non-missing elements of array X that have an index belonging to ind
    Z = []
    
    for x in X[ind]
        if x != "-"
            push!(Z, x)
        end
    end
    
    return Z
end
Out[16]:
FindNonMissingElements (generic function with 1 method)
In [17]:
for i = 1:N
    for j = 1:5
        var_name = symbol(var_names[j+1])
        
        if df[var_name][i] == "-"
            c = [1,2,3,4][Q .== df[:FavoriteProduct][i]][1] # class of current case
            X = FindNonMissingElements(df[var_name], C[c]) # non-missing elements of feature j, for the particular class of case i
            T = dt[j+1] # type of variable j      
            
            if T <: Number
                df[var_name][i] = round(T, median(X)) # since all of our feature variables are of type Int64, it makes sense to keep them this way
            else
                df[var_name][i] = mode(X)
            end
        end
    end
end

Now all our missing values should have been taken care of. Let's confirm this by taking another look at the first part of our data frame.

In [18]:
head(df, 10)
Out[18]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProduct
116VI30MActiveUser1050.0VideoGame
223VI45MNormalUser1000.0Book
3344IV33FNormalUser650.0Book
443VI40MModerator11000.0Book
557VI25FNormalUser100.0App
667IV29MActiveUser250.0VideoGame
774VI37MModerator700.0Book
886III20MActiveUser350.0VideoGame
997IV27MActiveUser450.0App
10107IV30FNormalUser350.0Book

Looks like everything is clean. Let's now finish encoding the variables into the most appropriate types

In [19]:
for i = 2:6
    var_name = symbol(var_names[i])
    df[var_name] = convert(Array{dt[i],1}, df[var_name])
end
In [20]:
typeof(df[:Age]) # check one of the variables to make sure it's of the correct type
Out[20]:
DataArrays.DataArray{Int64,1}

Let's now delve deeper into our data and see what else we can do to improve its quality.

Data Exploration, Data Representation

Since this dataset is quite simple, we'll tackle it using just plots, although a stats-based approach would also work.

In [21]:
using Gadfly # need to load a visualization package first

Let's first start with a few histograms. Firstly, let's examine the target variables since these are of the highest importance.

In [22]:
plot(df, x = "FavoriteProduct", Geom.histogram)
Out[22]:
FavoriteProduct VideoGame Book App None -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

It looks like most of our users are into apps, followed closely by the book fans. Very few people seem to not care about SW products at all. How much are the spending on this stuff though? Another histogram may shed some light on this question.

In [23]:
plot(df, x = "MoneySpent", Geom.histogram)
Out[23]:
MoneySpent -2.0×10⁴ -1.5×10⁴ -1.0×10⁴ -5.0×10³ 0 5.0×10³ 1.0×10⁴ 1.5×10⁴ 2.0×10⁴ 2.5×10⁴ 3.0×10⁴ 3.5×10⁴ -1.50×10⁴ -1.45×10⁴ -1.40×10⁴ -1.35×10⁴ -1.30×10⁴ -1.25×10⁴ -1.20×10⁴ -1.15×10⁴ -1.10×10⁴ -1.05×10⁴ -1.00×10⁴ -9.50×10³ -9.00×10³ -8.50×10³ -8.00×10³ -7.50×10³ -7.00×10³ -6.50×10³ -6.00×10³ -5.50×10³ -5.00×10³ -4.50×10³ -4.00×10³ -3.50×10³ -3.00×10³ -2.50×10³ -2.00×10³ -1.50×10³ -1.00×10³ -5.00×10² 0 5.00×10² 1.00×10³ 1.50×10³ 2.00×10³ 2.50×10³ 3.00×10³ 3.50×10³ 4.00×10³ 4.50×10³ 5.00×10³ 5.50×10³ 6.00×10³ 6.50×10³ 7.00×10³ 7.50×10³ 8.00×10³ 8.50×10³ 9.00×10³ 9.50×10³ 1.00×10⁴ 1.05×10⁴ 1.10×10⁴ 1.15×10⁴ 1.20×10⁴ 1.25×10⁴ 1.30×10⁴ 1.35×10⁴ 1.40×10⁴ 1.45×10⁴ 1.50×10⁴ 1.55×10⁴ 1.60×10⁴ 1.65×10⁴ 1.70×10⁴ 1.75×10⁴ 1.80×10⁴ 1.85×10⁴ 1.90×10⁴ 1.95×10⁴ 2.00×10⁴ 2.05×10⁴ 2.10×10⁴ 2.15×10⁴ 2.20×10⁴ 2.25×10⁴ 2.30×10⁴ 2.35×10⁴ 2.40×10⁴ 2.45×10⁴ 2.50×10⁴ 2.55×10⁴ 2.60×10⁴ 2.65×10⁴ 2.70×10⁴ 2.75×10⁴ 2.80×10⁴ 2.85×10⁴ 2.90×10⁴ 2.95×10⁴ 3.00×10⁴ -2×10⁴ 0 2×10⁴ 4×10⁴ -1.5×10⁴ -1.4×10⁴ -1.3×10⁴ -1.2×10⁴ -1.1×10⁴ -1.0×10⁴ -9.0×10³ -8.0×10³ -7.0×10³ -6.0×10³ -5.0×10³ -4.0×10³ -3.0×10³ -2.0×10³ -1.0×10³ 0 1.0×10³ 2.0×10³ 3.0×10³ 4.0×10³ 5.0×10³ 6.0×10³ 7.0×10³ 8.0×10³ 9.0×10³ 1.0×10⁴ 1.1×10⁴ 1.2×10⁴ 1.3×10⁴ 1.4×10⁴ 1.5×10⁴ 1.6×10⁴ 1.7×10⁴ 1.8×10⁴ 1.9×10⁴ 2.0×10⁴ 2.1×10⁴ 2.2×10⁴ 2.3×10⁴ 2.4×10⁴ 2.5×10⁴ 2.6×10⁴ 2.7×10⁴ 2.8×10⁴ 2.9×10⁴ 3.0×10⁴ -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

So, there is someone who has spent over $10000 on SW merchantise. Let's take a closer look at this person's data.

In [24]:
ind = collect(1:N)[df[:MoneySpent] .> 10000]
df[ind,:]
Out[24]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProduct
143VI40MModerator11000.0Book

Well, this man seems like a very big fan (for one thing, he is a moderator in our blog) and he has been on this planet for 4 decades, but it's doubtful he spent that much money (especially considering the products sold are not that expensive). Even if he did, such a data point is bound to create some issues in our model later on (where we'll try to predict the money spent, based on the first five features), so it's best to deal with this outlier. One good strategy for this is to find the most similar user to him and copy that person's MoneySpent value. Let's start by looking at other moderators out there (there can't be that many, since it's a quite specialized role).

In [25]:
ind = df[:BlogRole] .== "Moderator"
df[ind, :]
Out[25]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProduct
143VI40MModerator11000.0Book
274VI37MModerator700.0Book
3397V40MModerator600.0Book
4767IV42MModerator600.0App
5855IV30MModerator500.0App

It looks like there is a very similar user (the one with ID = 7) who has watched about the same number of movies, is also a big fan of episode VI, is around the same age, is also a male moderator, and is fond of SW books too. It wouldn't be a stretch to copy this person's MoneySpent data to our outlier.

In [26]:
df[ind, :MoneySpent] = df[7, :MoneySpent]
Out[26]:
700.0

Now let's take a look at the actual histogram of the MoneySpent variable.

In [27]:
plot(df, x = "MoneySpent", Geom.histogram)
Out[27]:
MoneySpent -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -25 0 25 50 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

Let us now continue our exploration through the remaining variables (our features), omitting the ID one since it is just an identifier.

In [28]:
plot(df, x = "MoviesWatched", Geom.histogram)
Out[28]:
MoviesWatched -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Hm. It seems that the outlier we identified earlier wasn't the only case. There is another person who has watched close to 80 movies! So, unless we are dealing with a Jedi master who is using the Jedi mind trick on us, this record is unreliable and needs to be processed. Before we do anything, let's take a closer look at these outliers (basically anyone who has watched more than 7 movies):

In [29]:
ind = collect(1:N)[df[:MoviesWatched] .> 7]
df[:MoviesWatched][ind]
Out[29]:
2-element DataArrays.DataArray{Int64,1}:
 44
 77

It appears that whoever was typing this data had limited control over the motor functions of their fingers. So, a reasonable assumption is that these outliers correspond to 4 and 7 respectively. If we were not sure about what the actual values were, we could treat them as missing values (go back to Data Preparation). Also, if we had several more outliers, we should change them first and then treat the missing values anew, to ensure that the median value we are using is not thrown off by the outliers.

In [30]:
df[:MoviesWatched][ind] = [4, 7]
Out[30]:
2-element Array{Int64,1}:
 4
 7

Now let's see what the actual histogram of this variable looks like.

In [31]:
plot(df, x = "MoviesWatched", Geom.histogram)
Out[31]:
MoviesWatched -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 -5 0 5 10 15 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

No surprises here. Let's look at the FavoriteEpisode variable though.

In [32]:
plot(df, x = "FavoriteEpisode", Geom.histogram)
Out[32]:
FavoriteEpisode VI IV III V VII -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

No outliers here (though the fact that someone liked episode III more than all of the others is somewhat suspicious!). Interestingly, no-one liked episodes I and II the most, so there is still hope for the SW fans of the years to come... What about the Age variable though?

In [33]:
plot(df, x = "Age", Geom.histogram)
Out[33]:
Age -200 -150 -100 -50 0 50 100 150 200 250 300 350 -150 -145 -140 -135 -130 -125 -120 -115 -110 -105 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 -200 0 200 400 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 -12.5 -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 -10 0 10 20 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Hm. It looks like we have a centenial among our ranks! Moreover, that person seems to qualify for the Guiness Book of World Records. Maybe it's a Jedi Master following the steps of Yoda... Let's take a closer look at this person.

In [34]:
ind = collect(1:N)[df[:Age] .> 100]
df[ind, :]
Out[34]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProduct
1904VI145MActiveUser250.0Book

Definitely an old-timer, but it's doubtful this person is that old. It would be best to deal with this case as a missing value, since it's not feasible to accurately guess their actual age (he also belongs to a relatively common class of users). The fact that this person is into SW books (class 2) could help us provide a replacement value that won't affect our predictions of him, particularly the favorite product part. First, let's make his age a missing value.

In [35]:
df[ind,:Age] = 999 # since we can't use the "-" symbol for Int64 variables, we'll go ahead and give it an easily distinguishable value
Out[35]:
999

Before we continue though, it's best to take a look another percular cases, a user whose age is on the other end of the spectrum (perhaps a Jedi prodigy?).

In [36]:
ind = collect(1:N)[df[:Age] .< 10]
df[ind, :]
Out[36]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProduct
1874IV3MActiveUser300.0Book

Since there is no obvious way to replace this outlier in age, it would be best to also treat that user's age as a missing value.

In [37]:
df[ind, :Age] = 999
Out[37]:
999
In [38]:
ind = collect(1:N)[df[:Age] .== 999] # indexes of problematic cases (for the Age variable)
Out[38]:
2-element Array{Int64,1}:
 87
 90
In [39]:
for i in ind
    c = [1,2,3,4][Q .== df[:FavoriteProduct][i]][1] # class of current case
    nme = setdiff(C[c], ind)
    X = convert(Array,df[nme, :Age]) # non-missing elements of feature j, for the particular class of case i
    df[i, :Age] = round(Int64, median(X)) # since all of our feature variables are of type Int64, it makes sense to keep them this way
end

Now let's take a look at a more realistic histogram of the Age variable, before we continue to the Gender one.

In [40]:
plot(df, x = "Age", Geom.histogram)
Out[40]:
Age -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -20 -15 -10 -5 0 5 10 15 20 25 30 35 -15.0 -14.5 -14.0 -13.5 -13.0 -12.5 -12.0 -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5 30.0 -20 0 20 40 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
In [41]:
plot(df, x = "Gender", Geom.histogram)
Out[41]:
Gender M F -150 -100 -50 0 50 100 150 200 250 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 -100 0 100 200 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

Looks like the majority of our users are guys. Let's carry on to the next and final feature variable.

In [42]:
plot(df, x = "BlogRole", Geom.histogram)
Out[42]:
BlogRole ActiveUser NormalUser Moderator -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160

Nothing spectacular here. I'm pretty sure that distribution holds across most blogs. Let's delve a bit deeper though and see what relationships we can find among all these variables.

In [43]:
plot(x = df[2], y = df[3]) # relationship between MoviesWatched and FavoriteEpisode
Out[43]:
x -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 -5 0 5 10 15 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 VI IV III V VII y

Not much to see here (which is good, since it shows that these two features are independent from each other).

In [44]:
plot(x = df[2], y = df[4]) # relationship between MoviesWatched and Age
Out[44]:
x -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 -5 0 5 10 15 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 y

Nothing interesting here either. Apparently if someone is interested in the SW saga, age doesn't affect how many movies they watch.

In [45]:
plot(x = df[2], y = df[5]) # relationship between MoviesWatched and Gender
Out[45]:
x -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 -5 0 5 10 15 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 M F y

Same here. Gender isn't a factor when it comes to how many SW films they are going to watch.

In [46]:
plot(x = df[2], y = df[6]) # relationship between MoviesWatched and BlogRole
Out[46]:
x -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 -5 0 5 10 15 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 ActiveUser NormalUser Moderator y

There is a very weak signal here. It seems that if someone hasn't watched all 7 movies, they are bound to not be a moderator.

In [47]:
plot(x = df[2], y = df[7]) # relationship between MoviesWatched and MoneySpent
Out[47]:
x -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 -5 0 5 10 15 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 y

Not much to work with here either. Perhaps a faint signal connecting the number of movies watched with how much money they spend on SW stuff. It could be a fluke though.

In [48]:
plot(x = df[2], y = df[8]) # relationship between MoviesWatched and FavoriteProduct
Out[48]:
x -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 -5 0 5 10 15 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 VideoGame Book App None y

Even though this is again a quite weak signal, it may be somewhat useful. Based on this, if someone has watched 6 movies, they are unlikely to be completely disinterested in any of the SW products.

In [49]:
plot(x = df[3], y = df[4]) # relationship between FavoriteEpisode and Age
Out[49]:
x VI IV III V VII -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 y

It looks like episode III appeals to a relatively younger audience (compared to episodes V and VI, for example). That's something worth exploring further (data discovery stage)

In [50]:
plot(x = df[3], y = df[5]) # relationship between FavoriteEpisode and Gender
Out[50]:
x VI IV III V VII M F y

Interestingly, none of our female users seemed to like episode VII the most.

In [51]:
plot(x = df[3], y = df[6]) # relationship between FavoriteEpisode and BlogRole
Out[51]:
x VI IV III V VII ActiveUser NormalUser Moderator y

Interestingly, those who like episodes III or VII the most, weren't blog moderators

In [52]:
plot(x = df[3], y = df[7]) # relationship between FavoriteEpisode and MoneySpent
Out[52]:
x VI IV III V VII -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 y

It seems that those who liked episode VI the most tend to have spent more on SW stuff, overall.

In [53]:
plot(x = df[3], y = df[8]) # relationship between FavoriteEpisode and FavoriteProduct
Out[53]:
x VI IV III V VII VideoGame Book App None y

It appears that those who liked episodes IV or III were definitely not apathetic when it came to SW products, while those who prefered episode VII over the others, were not so much into video games.

In [54]:
plot(x = df[4], y = df[5]) # relationship between Age and Gender
Out[54]:
x -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 M F y

Interestingly most of our older users are men.

In [55]:
plot(x = df[4], y = df[6]) # relationship between Age and BlogRole
Out[55]:
x -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 ActiveUser NormalUser Moderator y

It looks like our most active users (who are not moderators) are somewhat younger than our normal users. Also, our moderators are all middle-aged.

In [56]:
plot(x = df[4], y = df[7]) # relationship between Age and MoneySpent
Out[56]:
x -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 y

Apparently, the high-spenders are in the middle aged group

In [57]:
plot(x = df[4], y = df[8]) # relationship between Age and FavoriteProduct
Out[57]:
x -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 VideoGame Book App None y

Not unexpectedly, our older users are more inclined to prefer books or nothing at all, when it comes to SW products.

In [58]:
plot(x = df[5], y = df[6]) # relationship between Gender and BlogRole
Out[58]:
x M F ActiveUser NormalUser Moderator y

This is definitely not the plot we are looking for (no interesting information whatsoever). Let's move along...

In [59]:
plot(x = df[5], y = df[7]) # relationship between Gender and MoneySpent
Out[59]:
x M F -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 y

It looks like our male users spend more on SW stuff, overall

In [60]:
plot(x = df[5], y = df[8]) # relationship between Gender and FavoriteProduct
Out[60]:
x M F VideoGame Book App None y

Not surprisingly, our female users are not so much into video games

In [61]:
plot(x = df[6], y = df[7]) # relationship between BlogRole and MoneySpent
Out[61]:
x ActiveUser NormalUser Moderator -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 y

Although active users and normal users can spend anything between 0 and $1000 on SW stuff, moderators always spend about the same and that's also a lot, compared to the other users.

In [62]:
plot(x = df[6], y = df[8]) # relationship between BlogRole and FavoriteProduct
Out[62]:
x ActiveUser NormalUser Moderator VideoGame Book App None y

It is clear that moderators are interested either in apps or books, but are definitely not apathetic nor into video games.

In [63]:
plot(x = df[7], y = df[8]) # relationship between MoneySpent and FavoriteProduct
Out[63]:
x -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 VideoGame Book App None y

Not surprisingly, those who are not into apps, or books, or video games, don't spend as much (perhaps they just go to conventions or something)

Data Representation (revisited)

Clearly, each one of the variables we have is not all that insightful, if used as-is (a quite typical phenomenon actually). Perhaps if we were to isolate the signals in them through partitioning the nominal ones...

Just from the data exploration part, we can easily draw a few actionable insights regarding their favorite product: if a user has watched 6 movies, this is a somewhat useful predictor (Favorite product is not None). The same applies if their favorite episode is VII, in which case the possibility of video games as a favorite product is scratched. Also, if the user is a she, their favorite product has to be something other than video games. Finally, if the user is a moderator, they are definitely into one of the 3 types of products. All these insights can be captured in various custom-made binary features such as :

  • IsMoviesWatched6 (takes the value true if the variable MoviesWatched is equal to 6, and false otherwise)
  • IsFemale
  • IsModerator
  • IsFavoriteEpisode7

(we could create additional binary features to represent all the possibilities but before doing so, it would be best to do some more digging in the data discovery stage, otherwise we run the risk of the feature set size exploding)

In [64]:
df[:IsMoviesWatched6] = (df[:MoviesWatched] .== 6)
df[:IsFavoriteEpisode7] = (df[:FavoriteEpisode] .== "VII")
df[:IsFemale] = (df[:Gender] .== "F")
df[:IsModerator] = (df[:BlogRole] .== "Moderator")
Out[64]:
100-element DataArrays.DataArray{Bool,1}:
 false
 false
 false
  true
 false
 false
  true
 false
 false
 false
 false
 false
 false
     ⋮
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
In [65]:
head(df)
Out[65]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProductIsMoviesWatched6IsFavoriteEpisode7IsFemaleIsModerator
116VI30MActiveUser1050.0VideoGametruefalsefalsefalse
223VI45MNormalUser1000.0Bookfalsefalsefalsefalse
334IV33FNormalUser650.0Bookfalsefalsetruefalse
443VI40MModerator700.0Bookfalsefalsefalsetrue
557VI25FNormalUser100.0Appfalsefalsetruefalse
667IV29MActiveUser250.0VideoGamefalsefalsefalsefalse

We can create an index array for these new features, along with the Age one, for easier access to them.

In [66]:
F = ["Age", "IsMoviesWatched6", "IsFemale", "IsModerator", "IsFavoriteEpisode7"]
F = map(symbol, F)
Out[66]:
5-element Array{Symbol,1}:
 :Age               
 :IsMoviesWatched6  
 :IsFemale          
 :IsModerator       
 :IsFavoriteEpisode7

Now it seems that we are ready for the next stage...

Data Modelling

Data Discovery

First, let's make use of a quite handy similarity metric (which is still not widely used), the symmetric similarity index. Just like all similarity metrics, it takes values between 0 and 1, the higher denoting more similarity. It is symetric in the sense that it captures negative similarities as well, something that the conventional similarities metrics fail to do. This allows you to see the actual similarity, just like the absolute value of the correlation coefficient for numeric features.

In [67]:
function SSI(x::BitArray{1}, y::BitArray{1})
    a = sum(!x & !y)
    b = sum(!x & y)
    c = sum(x & !y)
    d = sum(x & y)
    N = a + b + c + d
    x0 = (a + d) / N
    x1 = (b + c) / N
    return 2*max(x0, x1) - 1
end
Out[67]:
SSI (generic function with 1 method)

Although this function works for bit arrays only (for performance reasons), it's fairly easy to use on conventional binary arrays, by applying the BitArray() function to them.

In [68]:
SSI(BitArray(df[:IsFemale]), BitArray(df[:IsModerator]))
Out[68]:
0.6799999999999999

What about the similarity of a non-binary feature with a target? For that we can use the generalized version of SSI, which is designed to assess the predictive power of any nominal feature, for a non-binary target:

In [69]:
function SSI{T1 <: Any, T2 <: Any}(x::Array{T1, 1}, y::Array{T2, 1})
    N = length(x)
    Qx = sort(unique(x))
    Qy = sort(unique(y))
    cc = [length(Qx), length(Qy)]
    X = BitArray(N, cc[1])
    Y = BitArray(N, cc[2])
    S = Array(Float64, cc[1], cc[2])
    W = Array(Float64, cc[1], cc[2])

    for i = 1:cc[1]
        for j = 1:cc[2]
            xx = (x .== Qx[i])
            yy = (y .== Qy[j])
            W[i,j] = sum(xx & yy)
            S[i,j] = SSI(xx, yy)
        end
    end
    
    W /= N
    return sum(W.*S), S
end
Out[69]:
SSI (generic function with 2 methods)
In [70]:
SSI(convert(Array,df[:FavoriteEpisode]), convert(Array, df[:FavoriteProduct]))
Out[70]:
(0.26560000000000006,
5x4 Array{Float64,2}:
 0.12  0.28  0.72  0.68
 0.3   0.02  0.06  0.14
 0.12  0.08  0.6   0.36
 0.16  0.36  0.36  0.28
 0.16  0.32  0.72  0.52)

Clearly, this feature has potential, since it can predict the favorite product being video games (class 3) or if it's none (class 4) about 70% of the time, although it is a mediocre predictor for the other two classes.

In [71]:
SSI(convert(Array, df[:IsFavoriteEpisode7]), convert(Array, df[:FavoriteProduct]))
Out[71]:
(0.32439999999999997,
2x4 Array{Float64,2}:
 0.16  0.32  0.72  0.52
 0.16  0.32  0.72  0.52)
In [72]:
SSI(convert(Array, df[:MoviesWatched]), convert(Array, df[:FavoriteProduct]))
Out[72]:
(0.2188,
5x4 Array{Float64,2}:
 0.06  0.3   0.7   0.46
 0.14  0.14  0.42  0.34
 0.24  0.16  0.6   0.4 
 0.06  0.3   0.5   0.5 
 0.04  0.12  0.24  0.28)

Since several features tend to have some valuable aspects and others that are not that useful, it would make sense to binarize all the nominal features. Besides, this kind of representation would be useful for some classifiers.

In [73]:
function binarize{T <: Any}(X::Array{T, 1}, nv::Int64 = 0) # X is a nominal feature
    ux = sort(unique(X))
    if nv == 0; nv = length(ux); end
    Y = falses(length(X), nv) # feature values
    fn = Array(ASCIIString, nv) # feature names

    for i = 1:nv
        Y[(X .== ux[i]), i] = true
        fn[i] = string("Is_", ux[i])
    end

    return Y, fn
end

function binarize{T <: Any}(X::Array{T, 2}) # X is a set of nominal features
    N, n = size(X)
    if n == 1; return binarize(vec(X), 0); end # single feature case
    M = Array(Int64, n) # number of unique values for the various features in X
    UX = Array(Any, n) # Unique values of X

    for i = 1:n
        UX[i] = sort(unique(X[:,i]))
        M[i] = length(UX[i])
    end

    tnbf = sum(M) # Total Number of Binary Features
    Y = falses(N, tnbf) # feature values
    fn = Array(ASCIIString, tnbf) # feature names
    c = 0 # binary feature counter

    for i = 1:n # feature to be binarized
        for j = 1:M[i] # value of feature i to be processed
        c += 1 # binary feature
            Y[(X[:,i] .== UX[i][j]), c] = true
            fn[c] = string("Feat", i, "_", UX[i][j])
        end
    end

    return Y, fn
end
Out[73]:
binarize (generic function with 3 methods)
In [74]:
non_numeric_features = convert(Array,df[[:MoviesWatched, :FavoriteEpisode, :Gender, :BlogRole]])
Out[74]:
100x4 Array{Any,2}:
 6  "VI"   "M"  "ActiveUser"
 3  "VI"   "M"  "NormalUser"
 4  "IV"   "F"  "NormalUser"
 3  "VI"   "M"  "Moderator" 
 7  "VI"   "F"  "NormalUser"
 7  "IV"   "M"  "ActiveUser"
 4  "VI"   "M"  "Moderator" 
 6  "III"  "M"  "ActiveUser"
 7  "IV"   "M"  "ActiveUser"
 7  "IV"   "F"  "NormalUser"
 5  "V"    "M"  "NormalUser"
 7  "VII"  "M"  "ActiveUser"
 3  "IV"   "F"  "NormalUser"
 ⋮                          
 7  "VII"  "M"  "ActiveUser"
 4  "VI"   "M"  "ActiveUser"
 6  "IV"   "F"  "NormalUser"
 4  "VI"   "M"  "NormalUser"
 7  "VI"   "M"  "NormalUser"
 7  "VI"   "M"  "NormalUser"
 5  "V"    "M"  "NormalUser"
 3  "V"    "M"  "NormalUser"
 4  "IV"   "M"  "NormalUser"
 3  "VI"   "M"  "ActiveUser"
 6  "III"  "M"  "ActiveUser"
 6  "IV"   "M"  "ActiveUser"
In [75]:
bdf, feat_names = binarize(non_numeric_features);
# binarized nominal features
Out[75]:
(
100x15 BitArray{2}:
 false  false  false   true  false  …  false   true   true  false  false
  true  false  false  false  false     false   true  false  false   true
 false   true  false  false  false      true  false  false  false   true
  true  false  false  false  false     false   true  false   true  false
 false  false  false  false   true      true  false  false  false   true
 false  false  false  false   true  …  false   true   true  false  false
 false   true  false  false  false     false   true  false   true  false
 false  false  false   true  false     false   true   true  false  false
 false  false  false  false   true     false   true   true  false  false
 false  false  false  false   true      true  false  false  false   true
 false  false   true  false  false  …  false   true  false  false   true
 false  false  false  false   true     false   true   true  false  false
  true  false  false  false  false      true  false  false  false   true
     ⋮                              ⋱      ⋮                            
 false  false  false  false   true     false   true   true  false  false
 false   true  false  false  false     false   true   true  false  false
 false  false  false   true  false  …   true  false  false  false   true
 false   true  false  false  false     false   true  false  false   true
 false  false  false  false   true     false   true  false  false   true
 false  false  false  false   true     false   true  false  false   true
 false  false   true  false  false     false   true  false  false   true
  true  false  false  false  false  …  false   true  false  false   true
 false   true  false  false  false     false   true  false  false   true
  true  false  false  false  false     false   true   true  false  false
 false  false  false   true  false     false   true   true  false  false
 false  false  false   true  false     false   true   true  false  false,

ASCIIString["Feat1_3","Feat1_4","Feat1_5","Feat1_6","Feat1_7","Feat2_III","Feat2_IV","Feat2_V","Feat2_VI","Feat2_VII","Feat3_F","Feat3_M","Feat4_ActiveUser","Feat4_Moderator","Feat4_NormalUser"])
In [76]:
feat_names
Out[76]:
15-element Array{ASCIIString,1}:
 "Feat1_3"         
 "Feat1_4"         
 "Feat1_5"         
 "Feat1_6"         
 "Feat1_7"         
 "Feat2_III"       
 "Feat2_IV"        
 "Feat2_V"         
 "Feat2_VI"        
 "Feat2_VII"       
 "Feat3_F"         
 "Feat3_M"         
 "Feat4_ActiveUser"
 "Feat4_Moderator" 
 "Feat4_NormalUser"
In [77]:
temp = convert(Array, df[:FavoriteProduct])
bfp, product_name = binarize(temp) # binarized favorite product
Out[77]:
(
100x4 BitArray{2}:
 false  false  false   true
 false   true  false  false
 false   true  false  false
 false   true  false  false
  true  false  false  false
 false  false  false   true
 false   true  false  false
 false  false  false   true
  true  false  false  false
 false   true  false  false
 false  false   true  false
  true  false  false  false
  true  false  false  false
     ⋮                     
 false  false   true  false
 false   true  false  false
  true  false  false  false
 false  false  false   true
  true  false  false  false
 false   true  false  false
 false   true  false  false
 false  false   true  false
 false   true  false  false
 false   true  false  false
 false  false  false   true
 false   true  false  false,

ASCIIString["Is_App","Is_Book","Is_None","Is_VideoGame"])
In [78]:
N, nbf = size(bdf) # nbf = number of binary features
ndf = DataFrame() # all numeric data frame

for i = 1:nbf
    ndf[symbol(feat_names[i])] = DataArray(float(bdf[:,i]))
end

ndf[:Age] = df[:Age]
ndf[:MoneySpent] = df[:MoneySpent]

for i = 1:4
    ndf[symbol(product_name[i])] = DataArray(float(bfp[:,i]))
end
In [79]:
head(ndf)
Out[79]:
Feat1_3Feat1_4Feat1_5Feat1_6Feat1_7Feat2_IIIFeat2_IVFeat2_VFeat2_VIFeat2_VIIFeat3_FFeat3_MFeat4_ActiveUserFeat4_ModeratorFeat4_NormalUserAgeMoneySpentIs_AppIs_BookIs_NoneIs_VideoGame
10.00.00.01.00.00.00.00.01.00.00.01.01.00.00.0301050.00.00.00.01.0
21.00.00.00.00.00.00.00.01.00.00.01.00.00.01.0451000.00.01.00.00.0
30.01.00.00.00.00.01.00.00.00.01.00.00.00.01.033650.00.01.00.00.0
41.00.00.00.00.00.00.00.01.00.00.01.00.01.00.040700.00.01.00.00.0
50.00.00.00.01.00.00.00.01.00.01.00.00.00.01.025100.01.00.00.00.0
60.00.00.00.01.00.01.00.00.00.00.01.01.00.00.029250.00.00.00.01.0

We can run a few more tests on the dataset and see if we can come up with other insteresting variables (e.g. combos of the ones we have already), but we have a deadline to meet! So, let's save our work so far and carry on to the next stage of our analysis. To ensure that even non-Julians can access this data, we save everything in .csv format.

In [80]:
writetable("CaffeineForTheForce_processed.csv", df)
In [81]:
writetable("CaffeineForTheForce_all_numeric.csv", ndf)
In [82]:
ndf = readtable("CaffeineForTheForce_all_numeric.csv") # we can skip this command if we are running the script from beginning to end. It is very useful however to have it in case we need to redo, or check the data modeling part, without having to go through the data engineering bit every time.
Out[82]:
Feat1_3Feat1_4Feat1_5Feat1_6Feat1_7Feat2_IIIFeat2_IVFeat2_VFeat2_VIFeat2_VIIFeat3_FFeat3_MFeat4_ActiveUserFeat4_ModeratorFeat4_NormalUserAgeMoneySpentIs_AppIs_BookIs_NoneIs_VideoGame
10.00.00.01.00.00.00.00.01.00.00.01.01.00.00.0301050.00.00.00.01.0
21.00.00.00.00.00.00.00.01.00.00.01.00.00.01.0451000.00.01.00.00.0
30.01.00.00.00.00.01.00.00.00.01.00.00.00.01.033650.00.01.00.00.0
41.00.00.00.00.00.00.00.01.00.00.01.00.01.00.040700.00.01.00.00.0
50.00.00.00.01.00.00.00.01.00.01.00.00.00.01.025100.01.00.00.00.0
60.00.00.00.01.00.01.00.00.00.00.01.01.00.00.029250.00.00.00.01.0
70.01.00.00.00.00.00.00.01.00.00.01.00.01.00.037700.00.01.00.00.0
80.00.00.01.00.01.00.00.00.00.00.01.01.00.00.020350.00.00.00.01.0
90.00.00.00.01.00.01.00.00.00.00.01.01.00.00.027450.01.00.00.00.0
100.00.00.00.01.00.01.00.00.00.01.00.00.00.01.030350.00.01.00.00.0
110.00.01.00.00.00.00.01.00.00.00.01.00.00.01.03350.00.00.01.00.0
120.00.00.00.01.00.00.00.00.01.00.01.01.00.00.032300.01.00.00.00.0
131.00.00.00.00.00.01.00.00.00.01.00.00.00.01.039300.01.00.00.00.0
140.00.01.00.00.00.00.00.00.01.00.01.01.00.00.042400.01.00.00.00.0
150.01.00.00.00.00.00.01.00.00.00.01.01.00.00.045100.00.00.01.00.0
160.00.00.00.01.00.00.00.01.00.01.00.00.00.01.048250.00.01.00.00.0
170.00.00.00.01.00.01.00.00.00.00.01.00.00.01.051400.01.00.00.00.0
180.00.00.00.01.00.00.00.01.00.00.01.00.00.01.054550.00.01.00.00.0
190.00.00.01.00.00.00.01.00.00.00.01.00.00.01.057450.00.01.00.00.0
200.01.00.00.00.00.00.00.01.00.00.01.01.00.00.060100.00.00.01.00.0
210.00.01.00.00.00.01.00.00.00.00.01.00.00.01.040150.00.01.00.00.0
221.00.00.00.00.00.00.01.00.00.00.01.00.00.01.06650.00.00.01.00.0
230.01.00.00.00.00.01.00.00.00.00.01.01.00.00.034250.00.00.00.01.0
240.00.00.01.00.00.00.00.01.00.00.01.00.00.01.045300.00.01.00.00.0
250.01.00.00.00.00.00.01.00.00.00.01.01.00.00.033350.01.00.00.00.0
260.00.00.00.01.00.00.00.01.00.01.00.01.00.00.04050.00.00.01.00.0
270.00.00.00.01.00.01.00.00.00.00.01.01.00.00.026200.00.00.00.01.0
280.00.00.00.01.00.00.00.00.01.00.01.01.00.00.029250.01.00.00.00.0
290.00.01.00.00.00.01.00.00.00.00.01.01.00.00.037500.00.00.00.01.0
300.00.00.01.00.00.01.00.00.00.00.01.01.00.00.021250.01.00.00.00.0
In [83]:
N, n = size(ndf)
Out[83]:
(100,21)
In [84]:
n -= 5 # adjust the number of features by subtracting the number of target variables
Out[84]:
16

Data Learning

Let's start by loading the necessary packages (we could use additional ML tools, but these should be enough for the problem at hand).

In [85]:
using DecisionTree # a package with some commonly used classifiers
using BackpropNeuralNet # a package for a basic neural network
using ELM # a package for Extreme Learning Machines
using GLM # a package for Generalized Linear Models (regression)
using ROC # a great validation package for classification problems
INFO: Recompiling stale cache file C:\Users\Zacharias\.julia\lib\v0.4\ScikitLearnBase.ji for module ScikitLearnBase.
WARNING: Union(args...) is deprecated, use Union{args...} instead.
 in depwarn at deprecated.jl:73
 in call at deprecated.jl:50
 in include at boot.jl:261
 in include_from_node1 at loading.jl:320
 in include at boot.jl:261
 in include_from_node1 at loading.jl:320
 in require at loading.jl:259
 in include_string at loading.jl:282
 in execute_request_0x535c5df2 at C:\Users\Zacharias\.julia\v0.4\IJulia\src\execute_request.jl:183
 in eventloop at C:\Users\Zacharias\.julia\v0.4\IJulia\src\IJulia.jl:143
 in anonymous at task.jl:447
while loading C:\Users\Zacharias\.julia\v0.4\ELM\src\base.jl, in expression starting on line 77
WARNING: Union(args...) is deprecated, use Union{args...} instead.
 in depwarn at deprecated.jl:73
 in call at deprecated.jl:50
 in include at boot.jl:261
 in include_from_node1 at loading.jl:320
 in include at boot.jl:261
 in include_from_node1 at loading.jl:320
 in require at loading.jl:259
 in include_string at loading.jl:282
 in execute_request_0x535c5df2 at C:\Users\Zacharias\.julia\v0.4\IJulia\src\execute_request.jl:183
 in eventloop at C:\Users\Zacharias\.julia\v0.4\IJulia\src\IJulia.jl:143
 in anonymous at task.jl:447
while loading C:\Users\Zacharias\.julia\v0.4\ELM\src\base.jl, in expression starting on line 77
WARNING: Union(args...) is deprecated, use Union{args...} instead.
 in depwarn at deprecated.jl:73
 in call at deprecated.jl:50
 in include at boot.jl:261
 in include_from_node1 at loading.jl:320
 in include at boot.jl:261
 in include_from_node1 at loading.jl:320
 in require at loading.jl:259
 in include_string at loading.jl:282
 in execute_request_0x535c5df2 at C:\Users\Zacharias\.julia\v0.4\IJulia\src\execute_request.jl:183
 in eventloop at C:\Users\Zacharias\.julia\v0.4\IJulia\src\IJulia.jl:143
 in anonymous at task.jl:447
while loading C:\Users\Zacharias\.julia\v0.4\ELM\src\base.jl, in expression starting on line 111
WARNING: using GLM.df in module Main conflicts with an existing identifier.
WARNING: requiring "Dates" in module "Winston" did not define a corresponding module.
WARNING: module Winston should explicitly import * from Base
WARNING: using ROC.plot in module Main conflicts with an existing identifier.

Classification

Here we'll see how we can get the computer to classify a user based on the features we have, onto the four categories of products they may be interested in.

First, let's prepare the data for the classifiers / regressors. We are going to use a 70%-30% split, which is typical for this kind of applications (although this could probably be done using a package, it's so simple that it would probably not save us any effort).

In [139]:
r = randperm(N);
tr = round(Int64, N*0.7) # number of training data points
te = N - tr # number of testing data points
Out[139]:
30
In [107]:
train_data = df[r[1:tr],:]; # training set for our data learning experiments
test_data = df[r[(tr+1):end],:]; # testing set for our data learning experiments
Out[107]:
IDMoviesWatchedFavoriteEpisodeAgeGenderBlogRoleMoneySpentFavoriteProductIsMoviesWatched6IsFavoriteEpisode7IsFemaleIsModerator
1455IV22FActiveUser300.0Appfalsefalsetruefalse
2937VI30MNormalUser200.0Appfalsefalsefalsefalse
3115V33MNormalUser50.0Nonefalsefalsefalsefalse
4627V30MNormalUser150.0Bookfalsefalsefalsefalse
5633IV35MActiveUser250.0Appfalsefalsefalsefalse
6246VI45MNormalUser300.0Booktruefalsefalsefalse
774VI37MModerator700.0Bookfalsefalsefalsetrue
8974IV52MNormalUser100.0Bookfalsefalsefalsefalse
9215IV40MNormalUser150.0Bookfalsefalsefalsefalse
10534IV32MActiveUser200.0Appfalsefalsefalsefalse
11543IV30MActiveUser200.0VideoGamefalsefalsefalsefalse
12904VI40MActiveUser250.0Bookfalsefalsefalsefalse
13715V27MActiveUser200.0Appfalsefalsefalsefalse
14463IV18MActiveUser150.0Bookfalsefalsefalsefalse
15924VI29MNormalUser250.0VideoGamefalsefalsefalsefalse
16597VI40MActiveUser350.0Appfalsefalsefalsefalse
17734VI33MActiveUser150.0Bookfalsefalsefalsefalse
18364IV34MActiveUser350.0Appfalsefalsefalsefalse
19567V34MActiveUser250.0VideoGamefalsefalsefalsefalse
20844VI32MActiveUser300.0Appfalsefalsefalsefalse
21437IV48MActiveUser450.0Appfalsefalsefalsefalse
22865V38MActiveUser250.0Appfalsefalsefalsefalse
23204VI60MActiveUser100.0Nonefalsefalsefalsefalse
24745IV36MActiveUser300.0Appfalsefalsefalsefalse
2586III20MActiveUser350.0VideoGametruefalsefalsefalse
26374IV36MActiveUser350.0Appfalsefalsefalsefalse
27107IV30FNormalUser350.0Bookfalsefalsetruefalse
28427IV40MActiveUser250.0Bookfalsefalsefalsefalse
29754IV32MActiveUser350.0Appfalsefalsefalsefalse
3043VI40MModerator700.0Bookfalsefalsefalsetrue

Now it would be a good time to see if we can get the computer to learn any patterns to be able to predict the FavoriteProduct variable for various users. Let's start with one of the simplest classifiers (which is also quite popular for problems of low dimensionality like this one).

In [108]:
ind = [2, 3, 4, 5, 6, 9, 10, 11]
features_training = convert(Array,(train_data[:, ind]));
labels_training = convert(Array,train_data[:, 8]);
features_testing = convert(Array,(test_data[:, ind]));
labels_testing = convert(Array,test_data[:, 8]);
In [109]:
model1 = build_tree(labels_training, features_training); # create the decision tree
Out[109]:
Decision Tree
Leaves: 35
Depth:  12
In [110]:
model1 = prune_tree(model1, 0.9); # cut down a few nodes to make it less prone to overfitting
Out[110]:
Decision Tree
Leaves: 35
Depth:  12
In [111]:
pred = apply_tree(model1, features_testing)
Out[111]:
30-element Array{Any,1}:
 "App"      
 "Book"     
 "App"      
 "Book"     
 "App"      
 "Book"     
 "App"      
 "App"      
 "Book"     
 "App"      
 "App"      
 "Book"     
 "App"      
 ⋮          
 "VideoGame"
 "App"      
 "VideoGame"
 "VideoGame"
 "None"     
 "VideoGame"
 "App"      
 "VideoGame"
 "Book"     
 "App"      
 "App"      
 "Book"     
In [112]:
AR = sum(labels_testing .== pred) / length(labels_testing)
Out[112]:
0.4666666666666667

This doesn't look very promising. Let's try another model along these lines...

In [113]:
model2 = build_forest(labels_training, features_training, 3, 20); # params = number of features in each tree, number of trees
Out[113]:
Ensemble of Decision Trees
Trees:      20
Avg Leaves: 18.3
Avg Depth:  7.55
In [114]:
pred = apply_forest(model2, features_testing)
Out[114]:
30-element Array{Any,1}:
 "App"      
 "Book"     
 "Book"     
 "Book"     
 "VideoGame"
 "Book"     
 "VideoGame"
 "Book"     
 "Book"     
 "App"      
 "VideoGame"
 "Book"     
 "App"      
 ⋮          
 "VideoGame"
 "VideoGame"
 "VideoGame"
 "App"      
 "None"     
 "VideoGame"
 "App"      
 "VideoGame"
 "Book"     
 "VideoGame"
 "App"      
 "Book"     
In [115]:
AR = sum(labels_testing .== pred) / length(labels_testing)
Out[115]:
0.5

That's better. Apparently a single tree can't create a good enough generalization but a series of them can do the trick. Let's see how a different kind of tree ensemble performs.

In [116]:
model3, coeffs = build_adaboost_stumps(labels_training, features_training, 10);
In [117]:
pred = apply_adaboost_stumps(model3, coeffs, features_testing)
Out[117]:
30-element Array{Any,1}:
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 ⋮    
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
 "App"
In [118]:
AR = sum(labels_testing .== pred) / length(labels_testing)
Out[118]:
0.43333333333333335

Apparently, this doesn't help our case much. Maybe we'll have better luck with a different type of classifier, such as a neural network.

Before doing so though, we'll need to get the all-numeric version of the dataset, since neural nets are incompatible with anything else.

In [119]:
train_data = ndf[r[1:tr],:]; # training set for our data learning experiments using all-numeric features
test_data = ndf[r[(tr+1):end],:]; # testing set for our data learning experiments using all-numeric features
Out[119]:
Feat1_3Feat1_4Feat1_5Feat1_6Feat1_7Feat2_IIIFeat2_IVFeat2_VFeat2_VIFeat2_VIIFeat3_FFeat3_MFeat4_ActiveUserFeat4_ModeratorFeat4_NormalUserAgeMoneySpentIs_AppIs_BookIs_NoneIs_VideoGame
10.00.01.00.00.00.01.00.00.00.01.00.01.00.00.022300.01.00.00.00.0
20.00.00.00.01.00.00.00.01.00.00.01.00.00.01.030200.01.00.00.00.0
30.00.01.00.00.00.00.01.00.00.00.01.00.00.01.03350.00.00.01.00.0
40.00.00.00.01.00.00.01.00.00.00.01.00.00.01.030150.00.01.00.00.0
51.00.00.00.00.00.01.00.00.00.00.01.01.00.00.035250.01.00.00.00.0
60.00.00.01.00.00.00.00.01.00.00.01.00.00.01.045300.00.01.00.00.0
70.01.00.00.00.00.00.00.01.00.00.01.00.01.00.037700.00.01.00.00.0
80.01.00.00.00.00.01.00.00.00.00.01.00.00.01.052100.00.01.00.00.0
90.00.01.00.00.00.01.00.00.00.00.01.00.00.01.040150.00.01.00.00.0
100.01.00.00.00.00.01.00.00.00.00.01.01.00.00.032200.01.00.00.00.0
111.00.00.00.00.00.01.00.00.00.00.01.01.00.00.030200.00.00.00.01.0
120.01.00.00.00.00.00.00.01.00.00.01.01.00.00.040250.00.01.00.00.0
130.00.01.00.00.00.00.01.00.00.00.01.01.00.00.027200.01.00.00.00.0
141.00.00.00.00.00.01.00.00.00.00.01.01.00.00.018150.00.01.00.00.0
150.01.00.00.00.00.00.00.01.00.00.01.00.00.01.029250.00.00.00.01.0
160.00.00.00.01.00.00.00.01.00.00.01.01.00.00.040350.01.00.00.00.0
170.01.00.00.00.00.00.00.01.00.00.01.01.00.00.033150.00.01.00.00.0
180.01.00.00.00.00.01.00.00.00.00.01.01.00.00.034350.01.00.00.00.0
190.00.00.00.01.00.00.01.00.00.00.01.01.00.00.034250.00.00.00.01.0
200.01.00.00.00.00.00.00.01.00.00.01.01.00.00.032300.01.00.00.00.0
210.00.00.00.01.00.01.00.00.00.00.01.01.00.00.048450.01.00.00.00.0
220.00.01.00.00.00.00.01.00.00.00.01.01.00.00.038250.01.00.00.00.0
230.01.00.00.00.00.00.00.01.00.00.01.01.00.00.060100.00.00.01.00.0
240.00.01.00.00.00.01.00.00.00.00.01.01.00.00.036300.01.00.00.00.0
250.00.00.01.00.01.00.00.00.00.00.01.01.00.00.020350.00.00.00.01.0
260.01.00.00.00.00.01.00.00.00.00.01.01.00.00.036350.01.00.00.00.0
270.00.00.00.01.00.01.00.00.00.01.00.00.00.01.030350.00.01.00.00.0
280.00.00.00.01.00.01.00.00.00.00.01.01.00.00.040250.00.01.00.00.0
290.01.00.00.00.00.01.00.00.00.00.01.01.00.00.032350.01.00.00.00.0
301.00.00.00.00.00.00.00.01.00.00.01.00.01.00.040700.00.01.00.00.0
In [120]:
n = size(train_data, 2) - 5 # you can skip this command if you have already computed the number of variables n
Out[120]:
16
In [121]:
features_training_new = map(Float64,Array(train_data[:,1:n]))
features_testing_new = map(Float64,Array(test_data[:,1:n]))
Out[121]:
30x16 Array{Float64,2}:
 0.0  0.0  1.0  0.0  0.0  0.0  1.0  …  0.0  1.0  0.0  1.0  0.0  0.0  22.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  1.0  30.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  1.0  33.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  1.0  30.0
 1.0  0.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  1.0  1.0  0.0  0.0  35.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  …  0.0  0.0  1.0  0.0  0.0  1.0  45.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  1.0  0.0  37.0
 0.0  1.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  1.0  0.0  0.0  1.0  52.0
 0.0  0.0  1.0  0.0  0.0  0.0  1.0     0.0  0.0  1.0  0.0  0.0  1.0  40.0
 0.0  1.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  1.0  1.0  0.0  0.0  32.0
 1.0  0.0  0.0  0.0  0.0  0.0  1.0  …  0.0  0.0  1.0  1.0  0.0  0.0  30.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  40.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  27.0
 ⋮                        ⋮         ⋱       ⋮                         ⋮  
 0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  34.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  32.0
 0.0  0.0  0.0  0.0  1.0  0.0  1.0  …  0.0  0.0  1.0  1.0  0.0  0.0  48.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  38.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  60.0
 0.0  0.0  1.0  0.0  0.0  0.0  1.0     0.0  0.0  1.0  1.0  0.0  0.0  36.0
 0.0  0.0  0.0  1.0  0.0  1.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  20.0
 0.0  1.0  0.0  0.0  0.0  0.0  1.0  …  0.0  0.0  1.0  1.0  0.0  0.0  36.0
 0.0  0.0  0.0  0.0  1.0  0.0  1.0     0.0  1.0  0.0  0.0  0.0  1.0  30.0
 0.0  0.0  0.0  0.0  1.0  0.0  1.0     0.0  0.0  1.0  1.0  0.0  0.0  40.0
 0.0  1.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  1.0  1.0  0.0  0.0  32.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  1.0  0.0  40.0

It would be a good idea to normalize the last feature (Age), since, unlike Jedis, neural networks can get confused very easily with scale.

In [122]:
max_ = maximum(features_training_new[:,end])
min_ = minimum(features_training_new[:,end])
Out[122]:
12.0
In [123]:
features_training_new[:,end] = (features_training_new[:,end] - min_) / (max_ - min_);
features_testing_new[:,end] = (features_testing_new[:,end] - min_) / (max_ - min_);
In [124]:
labels_training_new = map(Float64,Array(train_data[:,(end-3):end]))
labels_testing_new = map(Float64,Array(test_data[:,(end-3):end]))
Out[124]:
30x4 Array{Float64,2}:
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0
 ⋮                 
 0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
In [125]:
println(labels_training_new[1:10,:])
println(labels_testing_new[1:10,:])
[1.0 0.0 0.0 0.0
 0.0 0.0 0.0 1.0
 0.0 1.0 0.0 0.0
 0.0 0.0 0.0 1.0
 1.0 0.0 0.0 0.0
 0.0 1.0 0.0 0.0
 1.0 0.0 0.0 0.0
 1.0 0.0 0.0 0.0
 1.0 0.0 0.0 0.0
 0.0 0.0 0.0 1.0]
[1.0 0.0 0.0 0.0
 1.0 0.0 0.0 0.0
 0.0 0.0 1.0 0.0
 0.0 1.0 0.0 0.0
 1.0 0.0 0.0 0.0
 0.0 1.0 0.0 0.0
 0.0 1.0 0.0 0.0
 0.0 1.0 0.0 0.0
 0.0 1.0 0.0 0.0
 1.0 0.0 0.0 0.0]
In [126]:
net = init_network([n, round(Int64, 1.5*n), m]); # create an artificial neural net
Out[126]:
BackpropNeuralNet.NeuralNetwork([16,24,4],false,0.25,0.1,(anonymous function),(anonymous function),(anonymous function),Array{Float64,N}[[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0  …  1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],[1.0,1.0,1.0,1.0]],Array{Float64,N}[
17x24 Array{Float64,2}:
 -0.716  -0.931   0.395  -0.879  -0.515  …  -0.663   0.587  -0.692  -0.463
 -0.321  -0.758  -0.425  -0.286   1.0       -0.37   -0.2     0.31    0.171
  0.673  -0.694  -0.942  -0.586   0.779     -0.864  -0.628   0.192   0.494
 -0.356   0.177   0.922   0.599   0.101      0.822  -0.385   0.526  -0.804
  0.041  -0.132   0.792  -0.448   0.232      0.176  -0.938  -0.261  -0.341
 -0.186   0.0    -0.426  -0.72   -0.999  …   0.72    0.029   0.899   0.437
  0.685   0.079   0.268   0.15    0.985     -0.328  -0.165  -0.965  -0.003
 -0.479   0.869   0.943  -0.431  -0.375      0.97   -0.408  -0.489  -0.686
  0.343  -0.153  -0.537   0.046  -0.287     -0.986   0.488  -0.229   0.846
 -0.377  -0.175  -0.001  -0.614   0.283      0.081  -0.089   0.434   0.444
 -0.978  -0.688  -0.245  -0.841   0.259  …  -0.848  -0.776  -0.671   0.526
  0.773  -0.674  -0.9    -0.541  -0.306      0.893   0.34   -0.5    -0.12 
  0.398   0.322   0.68   -0.286  -0.612      0.5     0.928   0.379  -0.199
  0.11    0.664  -0.519  -0.317   0.649     -0.351   0.96   -0.77    0.798
  0.125  -0.934  -0.699   0.876  -0.696     -0.126   0.185  -0.941  -0.641
  0.435   0.277   0.199   0.295  -0.579  …   0.754  -0.706  -0.158  -0.514
  0.733  -0.696  -0.665  -0.269  -0.061     -0.32    0.173  -0.728   0.441,

25x4 Array{Float64,2}:
 -0.941  -0.978   0.353  -0.346
 -0.849   0.009   0.54   -0.241
 -0.904  -0.016  -0.588  -0.573
 -0.049   0.819  -0.383   0.882
 -0.546  -0.282  -0.659  -0.353
 -0.464  -0.469   0.746   0.444
  0.776  -0.806   0.855   0.693
  0.887   0.727  -0.16    0.647
 -0.085  -0.499   0.865   0.841
 -0.219  -0.519   0.001  -0.533
 -0.412  -0.316   0.538   0.531
  0.112   0.538   0.32    0.708
  0.382   0.834  -0.346  -0.972
 -0.353  -0.446   0.371  -0.051
 -0.014  -0.405   0.437   0.565
  0.635  -0.765   0.077   0.964
  0.114  -0.549   0.494  -0.126
  0.937   0.198   0.659   0.37 
 -0.377   0.394  -0.185  -0.201
  0.115   0.211  -0.932   0.563
  0.628  -0.314   0.095   0.016
 -0.032  -0.112  -0.304  -0.282
 -0.554  -0.481  -0.138  -0.478
  0.72   -0.368   0.854  -0.866
  0.712  -0.85    0.812   0.054],Array{Float64,N}[
17x24 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0,

25x4 Array{Float64,2}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0],Array{Float64,N}[#undef])
In [127]:
for j = 1:2500 # train for 2500 epochs
    if mod(j,500) == 0; println(j); end # this is just to show how the training progresses
    
    for i = 1:size(features_training_new, 1)
        a = features_training_new[i,:]
        b = labels_training_new[i,:]
        train(net, a[:], b[:])
    end
end
500
1000
1500
2000
2500
In [128]:
nn = size(features_testing_new, 1)
pred_labels = Array(Float64, nn, m)

for i = 1:nn
    pred_labels[i,:] = net_eval(net, features_testing_new[i,:][:])
end
In [129]:
function ANN2vector{T <: Any}(Z::Array{Float64, 2}, Q::Array{T,1})
    # turn a matrix one from an ANN into a vector output
    N, noc = size(Z)
    z = Array(T, N)
    p = Array(Float64, N)
    
    for i = 1:N
        p[i], ind = findmax(Z[i,:])
        z[i] = Q[ind]
    end
    
    return z, p
end
Out[129]:
ANN2vector (generic function with 1 method)
In [130]:
QQ = sort(Array(Q))
Out[130]:
4-element Array{AbstractString,1}:
 "App"      
 "Book"     
 "None"     
 "VideoGame"
In [131]:
pred, prob = ANN2vector(pred_labels, QQ)
Out[131]:
(AbstractString["App","Book","Book","Book","App","Book","Book","Book","Book","VideoGame"  …  "App","App","Book","VideoGame","VideoGame","App","App","App","VideoGame","Book"],[0.951835,0.997838,0.459711,0.610938,0.999401,1.0,0.67811,0.998277,0.998538,0.908261  …  0.977297,0.999658,0.998663,0.969611,0.999105,0.350335,0.859202,0.907114,0.908261,0.999962])
In [132]:
AR = sum(labels_testing .== pred) / length(labels_testing)
Out[132]:
0.43333333333333335

Hmm. Perhaps there is a limit to how good performance we can get with this dataset. Let's see if ELMs can give us a better prediction.

In [140]:
targets_training = ones(tr)
targets_testing = ones(te)
Out[140]:
30-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
In [142]:
ind_book = (labels_training .== "Book")
ind_none = (labels_training .== "None")
ind_game = (labels_training .== "VideoGame")
Out[142]:
70-element BitArray{1}:
 false
  true
 false
  true
 false
 false
 false
 false
 false
  true
 false
  true
 false
     ⋮
 false
 false
 false
  true
  true
 false
  true
 false
 false
 false
 false
 false
In [143]:
targets_training[ind_book] = 2.0
targets_training[ind_none] = 3.0
targets_training[ind_game] = 4.0
Out[143]:
4.0
In [144]:
ind_book = (labels_testing .== "Book")
ind_none = (labels_testing .== "None")
ind_game = (labels_testing .== "VideoGame")
Out[144]:
30-element BitArray{1}:
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
  true
 false
 false
     ⋮
  true
 false
 false
 false
 false
 false
  true
 false
 false
 false
 false
 false
In [145]:
targets_testing[ind_book] = 2.0
targets_testing[ind_none] = 3.0
targets_testing[ind_game] = 4.0
Out[145]:
4.0
In [146]:
elm = ExtremeLearningMachine(35);
In [147]:
ELM.fit!(elm, features_training_new, targets_training)
Out[147]:
1x35 Array{Float64,2}:
 -3.11041  -26.2388  4.72645  -10.6715  …  -11.7578  -8.0015  -20.4951
In [148]:
outputs = ELM.predict(elm, features_testing_new)
Out[148]:
30-element Array{Float64,1}:
  0.127967 
  2.72315  
  1.64368  
  2.51131  
  1.70565  
  1.88713  
 -0.758001 
  0.0332703
  3.56331  
  2.00462  
  1.44687  
  3.99414  
  0.674724 
  ⋮        
  2.42811  
  3.70569  
  2.58183  
  1.36484  
  4.42129  
  3.1962   
  3.94834  
  2.28888  
  1.54135  
  2.50779  
  2.00462  
 -5.98197  

Clearly, we'll need to adjust the outputs so that they can be mapped to the 4 classes we have.

In [149]:
outputs[outputs .< 1.0] = 1.0
Out[149]:
1.0
In [150]:
outputs[outputs .> 4.0] = 4.0
Out[150]:
4.0
In [151]:
pred = QQ[round(Int64, outputs)]
Out[151]:
30-element Array{AbstractString,1}:
 "App"      
 "None"     
 "Book"     
 "None"     
 "Book"     
 "Book"     
 "App"      
 "App"      
 "VideoGame"
 "Book"     
 "App"      
 "VideoGame"
 "App"      
 ⋮          
 "Book"     
 "VideoGame"
 "None"     
 "App"      
 "VideoGame"
 "None"     
 "VideoGame"
 "Book"     
 "Book"     
 "None"     
 "Book"     
 "App"      
In [152]:
AR = sum(labels_testing .== pred) / length(labels_testing)
Out[152]:
0.23333333333333334

Time to evaluate the results in more depth. In this part we'll be using the ROC curve, which is a very robust validation strategy. Since only some of the classifiers performed adequately, we'll focus on just these ones (random forest and neural network).

In [153]:
prob = apply_forest_proba(model2, features_testing, QQ)
p = maximum(prob, 2)
Out[153]:
30x1 Array{Float64,2}:
 0.75
 0.75
 0.4 
 0.75
 0.3 
 1.0 
 0.5 
 0.5 
 0.5 
 0.55
 0.3 
 0.45
 0.7 
 ⋮   
 0.6 
 0.55
 0.35
 0.55
 0.45
 0.65
 0.5 
 0.5 
 0.75
 0.35
 0.55
 0.5 
In [154]:
p = p[:]; # make probabilities Array 1-dimensional
Out[154]:
30-element Array{Float64,1}:
 0.75
 0.75
 0.4 
 0.75
 0.3 
 1.0 
 0.5 
 0.5 
 0.5 
 0.55
 0.3 
 0.45
 0.7 
 ⋮   
 0.6 
 0.55
 0.35
 0.55
 0.45
 0.65
 0.5 
 0.5 
 0.75
 0.35
 0.55
 0.5 
In [155]:
z = (apply_forest(model2, features_testing) .== labels_testing)
Out[155]:
30-element BitArray{1}:
  true
 false
 false
  true
 false
  true
 false
  true
  true
  true
  true
  true
  true
     ⋮
  true
 false
 false
  true
  true
 false
 false
 false
  true
 false
  true
  true
In [156]:
rc1 = ROC.roc(p, z)
Out[156]:
ROC.ROCData{Float64}([1.0,0.75,0.75,0.75,0.75,0.7,0.65,0.6,0.55,0.55  …  0.5,0.5,0.45,0.45,0.45,0.4,0.35,0.35,0.3,0.3],Bool[true,true,false,true,true,true,false,true,true,false  …  false,true,true,false,true,false,false,false,false,true],15,30,15,1:29,[1,2,2,3,4,5,5,6,7,7  …  11,11,12,13,13,14,14,14,14,14],[15,15,14,14,14,14,13,13,13,12  …  6,5,5,5,4,4,3,2,1,0],[0,0,1,1,1,1,2,2,2,3  …  9,10,10,10,11,11,12,13,14,15],[14,13,13,12,11,10,10,9,8,8  …  4,4,3,2,2,1,1,1,1,1],[0.0,0.0,0.0666667,0.0666667,0.0666667,0.0666667,0.133333,0.133333,0.133333,0.2  …  0.6,0.666667,0.666667,0.666667,0.733333,0.733333,0.8,0.866667,0.933333,1.0],[0.0666667,0.133333,0.133333,0.2,0.266667,0.333333,0.333333,0.4,0.466667,0.466667  …  0.733333,0.733333,0.8,0.866667,0.866667,0.933333,0.933333,0.933333,0.933333,0.933333])
In [237]:
ROC.plot(rc1)
Out[237]:
In [157]:
AUC(rc1)
Out[157]:
0.6844444444444443
In [158]:
pred, prob = ANN2vector(pred_labels, QQ)
Out[158]:
(AbstractString["App","Book","Book","Book","App","Book","Book","Book","Book","VideoGame"  …  "App","App","Book","VideoGame","VideoGame","App","App","App","VideoGame","Book"],[0.951835,0.997838,0.459711,0.610938,0.999401,1.0,0.67811,0.998277,0.998538,0.908261  …  0.977297,0.999658,0.998663,0.969611,0.999105,0.350335,0.859202,0.907114,0.908261,0.999962])
In [159]:
z = (pred .== labels_testing);
In [160]:
rc2 = ROC.roc(prob, z)
Out[160]:
ROC.ROCData{Float64}([1.0,0.999962,0.999664,0.999658,0.999401,0.999327,0.999323,0.999283,0.99924,0.999201  …  0.908261,0.908261,0.907114,0.859202,0.749265,0.67811,0.610938,0.459711,0.350335,0.217985],Bool[true,true,true,true,true,false,false,false,false,false  …  false,false,false,false,false,true,true,false,true,false],13,30,17,1:29,[1,2,3,4,5,5,5,5,5,5  …  10,10,10,10,10,10,11,12,12,13],[17,17,17,17,17,16,15,14,13,12  …  7,6,5,4,3,2,2,2,1,1],[0,0,0,0,0,1,2,3,4,5  …  10,11,12,13,14,15,15,15,16,16],[12,11,10,9,8,8,8,8,8,8  …  3,3,3,3,3,3,2,1,1,0],[0.0,0.0,0.0,0.0,0.0,0.0588235,0.117647,0.176471,0.235294,0.294118  …  0.588235,0.647059,0.705882,0.764706,0.823529,0.882353,0.882353,0.882353,0.941176,0.941176],[0.0769231,0.153846,0.230769,0.307692,0.384615,0.384615,0.384615,0.384615,0.384615,0.384615  …  0.769231,0.769231,0.769231,0.769231,0.769231,0.769231,0.846154,0.923077,0.923077,1.0])
In [161]:
ROC.plot(rc2)
Out[161]:
In [162]:
AUC(rc2)
Out[162]:
0.5701357466063349

Based on all of the above analysis, we can conclude that the random forest classifier works best for this particular problem.

Regression

Let's now do the same for the prediction of the MoneySpent variable.

In [163]:
join(names(train_data), " + ")
Out[163]:
"Feat1_3 + Feat1_4 + Feat1_5 + Feat1_6 + Feat1_7 + Feat2_III + Feat2_IV + Feat2_V + Feat2_VI + Feat2_VII + Feat3_F + Feat3_M + Feat4_ActiveUser + Feat4_Moderator + Feat4_NormalUser + Age + MoneySpent + Is_App + Is_Book + Is_None + Is_VideoGame"
In [164]:
reg_model1 = GLM.fit(LinearModel, MoneySpent ~ Feat1_3 + Feat1_4 + Feat1_5 + Feat1_6 + Feat1_7 + Feat2_III + Feat2_IV + Feat2_V + Feat2_VI + Feat2_VII + Feat3_F + Feat3_M + Feat4_ActiveUser + Feat4_Moderator + Feat4_NormalUser + Age, train_data)
Out[164]:
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Float64}

Formula: MoneySpent ~ 1 + Feat1_3 + Feat1_4 + Feat1_5 + Feat1_6 + Feat1_7 + Feat2_III + Feat2_IV + Feat2_V + Feat2_VI + Feat2_VII + Feat3_F + Feat3_M + Feat4_ActiveUser + Feat4_Moderator + Feat4_NormalUser + Age

Coefficients:
                     Estimate  Std.Error  t value Pr(>|t|)
(Intercept)       -1.78308e17 1.63228e17 -1.09238   0.2796
Feat1_3            1.36686e17 1.77525e17 0.769956   0.4447
Feat1_4            1.36686e17 1.74519e17 0.783219   0.4370
Feat1_5            1.36686e17 1.77629e17 0.769506   0.4450
Feat1_6            1.36686e17 1.78527e17 0.765635   0.4473
Feat1_7            1.36686e17  1.8221e17 0.750156   0.4565
Feat2_III          1.85318e17 1.38014e17  1.34274   0.1851
Feat2_IV           1.85318e17 1.35242e17  1.37027   0.1764
Feat2_V            1.85318e17 1.35242e17  1.37027   0.1764
Feat2_VI           1.85318e17 1.38014e17  1.34274   0.1851
Feat2_VII          1.85318e17 1.34733e17  1.37544   0.1748
Feat3_F            1.21665e17 6.86392e16  1.77253   0.0821
Feat3_M            1.21665e17 7.16466e16  1.69813   0.0953
Feat4_ActiveUser  -2.65361e17 1.45499e17 -1.82379   0.0738
Feat4_Moderator   -2.65361e17 1.45499e17 -1.82379   0.0738
Feat4_NormalUser  -2.65361e17 1.45499e17 -1.82379   0.0738
Age                   5.35697     3.1646  1.69278   0.0964

Most of these factors in our regression model fail to pass the significance test, so they'll have to go. The Feature 2 factors seem to be significant, even though they have a very high coefficient, so let's look into them more.

In [165]:
reg_model2 = GLM.fit(LinearModel, MoneySpent ~ Feat2_III + Feat2_IV + Feat2_V + Feat2_VI + Feat2_VII + Age, train_data)
Out[165]:
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Float64}

Formula: MoneySpent ~ 1 + Feat2_III + Feat2_IV + Feat2_V + Feat2_VI + Feat2_VII + Age

Coefficients:
                Estimate  Std.Error   t value Pr(>|t|)
(Intercept)   1.01655e17 1.40459e17  0.723734   0.4719
Feat2_III    -1.01655e17 1.40459e17 -0.723734   0.4719
Feat2_IV     -1.01655e17 1.40459e17 -0.723734   0.4719
Feat2_V      -1.01655e17 1.40459e17 -0.723734   0.4719
Feat2_VI     -1.01655e17 1.40459e17 -0.723734   0.4719
Feat2_VII    -1.01655e17 1.40459e17 -0.723734   0.4719
Age              5.35635    2.72344   1.96675   0.0536

Clearly these aren't the factors we are looking for. Let's move along now, to an even simpler model...

In [166]:
reg_model3 = GLM.fit(LinearModel, MoneySpent ~ Age, train_data)
Out[166]:
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Float64}

Formula: MoneySpent ~ 1 + Age

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   164.279   89.0075 1.84567   0.0693
Age            4.1754   2.42774 1.71987   0.0900
In [167]:
w = coef(reg_model3) # weights
Out[167]:
2-element Array{Float64,1}:
 164.279 
   4.1754
In [168]:
reg_pred = test_data[:Age] * w[2] + w[1]
Out[168]:
30-element DataArrays.DataArray{Float64,1}:
 256.137
 289.541
 302.067
 289.541
 310.418
 352.172
 318.768
 381.399
 331.295
 297.891
 289.541
 331.295
 277.014
   ⋮    
 306.242
 297.891
 364.698
 322.944
 414.802
 314.593
 247.787
 314.593
 289.541
 331.295
 297.891
 331.295

Time to evaluate the results. In this section we'll use the mean squared error (MSE) metric, which is the most popular way to assess a regression model.

In [169]:
MSE(t::DataArray{Float64, 1}, y::DataArray{Float64, 1}) = mean((t-y).^2)
Out[169]:
MSE (generic function with 1 method)
In [170]:
MSE(test_data[:MoneySpent], reg_pred)
Out[170]:
23239.837621415227

Based on the above, we can conclude that the Age feature is the only one that's actually relevant in predicting the amount of money spent by one of our members.

Information Distillation

Data Product Creation

For this stage, there is not much we can do in Julia. However, we decide to use a GUI development software that makes use of our Julia model on the backend. The product becomes a success.

Insight, Deliverance, and Visualization

After the 1-on-1 meeting with Mrs. Jones, it became apparent that one thing that's very useful for the her, as well as the other stakeholders in this project, is the value of the different data used in our models. This can lead her to make important decisions about how we will organize the next release of this project and provide the marketing team with the requirements of the next version of the questionnaire. So, without wasting too much time we set off to deliver this useful insight, in a quite visual way.

In [171]:
Z = DataFrame(Any, length(ind), 2)
c = 0

for i in ind
    c += 1
    Z[c, 1] = names(df)[i]
    Z[c, 2] = SSI(convert(Array,df[:,i]), convert(Array, df[:FavoriteProduct]))[1]
    println(Z[c, 1], ": ", round(Z[c, 2], 3)[1])
end
MoviesWatched: 0.219
FavoriteEpisode: 0.266
Age: 0.374
Gender: 0.304
BlogRole: 0.288
IsMoviesWatched6: 0.254
IsFavoriteEpisode7: 0.324
IsFemale: 0.304
In [172]:
rename!(Z, symbol("x1"), symbol("Feature"));
rename!(Z, symbol("x2"), symbol("Favorite_Product_Similarity"));
In [173]:
visual = plot(Z, x="Feature", y="Favorite_Product_Similarity", Geom.bar, Scale.y_continuous(minvalue=0.2, maxvalue=0.4))
Out[173]:
Feature MoviesWatched FavoriteEpisode Age Gender BlogRole IsMoviesWatched6 IsFavoriteEpisode7 IsFemale -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.0 0.2 0.4 0.6 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 Favorite_Product_Similarity
In [174]:
draw(PNG("CftF final visual.png", 30cm, 15cm), visual)

So, in the future it would make sense to have a question like "Is episode VII your favorite episode?", along with the age and gender ones, while possibly adding some new questions.